Note: In this chapter we learn sampling distributions.

  1. First we will look at a simple activity like example.
  1. We will have sections named “DETOUR #”, we will learn some brand name distributions in these sections.

Let’s begin…

1 Sampling Distribution of the sample proportion

1.1 What proportion of this bowl’s balls are red?

Take a look at the bowl in the following Figure. It has a certain number of red and a certain number of white balls all of equal size. Furthermore, it appears the bowl has been mixed beforehand as there does not seem to be any particular pattern to the spatial distribution of red and white balls.

Let’s now ask ourselves, what proportion of this bowl’s balls are red?

One way to answer this question would be to perform an exhaustive count: remove each ball individually, count the number of red balls and the number of white balls, and divide the number of red balls by the total number of balls. However this would be a long and tedious process.

Observe that ____ of the balls are red and there are a total of ____ balls and thus ___ % of the shovel’s balls are red. We can view the proportion of balls that are red in this shovel as a guess of the proportion of balls that are red in the entire bowl. While not as exact as doing an exhaustive count, our guess of ___% took much less time and energy to obtain.

However, say, we started this activity over from the beginning. In other words, we replace the 50 balls back into the bowl and start over. Would we remove exactly 17 red balls again? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% again? Maybe?

What if we repeated this exercise several times? Would I obtain exactly 17 red balls each time? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% every time? Surely not.

Let’s try do this on the computer…

To this end, we use a data frame bowl in the moderndive package whose rows correspond exactly with the contents of the actual bowl.

head(bowl)
# A tibble: 6 x 2
  ball_ID color
    <int> <chr>
1       1 white
2       2 white
3       3 white
4       4 red  
5       5 white
6       6 white
# View(bowl) # Use this in the console

Observe in the output that bowl has ___ rows, telling us that the bowl contains ___ equally-sized balls. The first variable ball_ID is used merely as an “identification variable”, none of the balls in the actual bowl are marked with numbers. The second variable color indicates whether a particular virtual ball is red or white.

Now that we have a virtual analogue of our bowl, we now need a virtual analogue for the shovel seen in Figure 2; we’ll use this virtual shovel to generate our virtual random samples of 50 balls. We’re going to use the rep_sample_n() function included in the moderndive package. This function allows us to take repeated, or replicated, samples of size n. Run the following and explore.

virtual_shovel <- bowl %>% 
  rep_sample_n(size = 50)

virtual_shovel
# A tibble: 50 x 3
# Groups:   replicate [1]
   replicate ball_ID color
       <int>   <int> <chr>
 1         1     136 red  
 2         1     673 white
 3         1    1819 red  
 4         1     627 red  
 5         1     438 white
 6         1     510 white
 7         1     628 white
 8         1    1473 red  
 9         1     870 white
10         1    1636 white
# … with 40 more rows

Next we can find out how many res ones are there in our virtual_shovel

virtual_shovel %>% 
  summarize(num_red = sum(color=="red"))  
# A tibble: 1 x 2
  replicate num_red
      <int>   <int>
1         1      23

How about the proportion on red? We can use the mutate (new) function to create a new variable, in this case prop_red.

virtual_shovel %>% 
  summarize(num_red = sum(color == "red")) %>% 
  mutate(prop_red = num_red / 50)
# A tibble: 1 x 3
  replicate num_red prop_red
      <int>   <int>    <dbl>
1         1      23     0.46

1.2 Using the virtual shovel many times

virtual_samples <- bowl %>% 
  rep_sample_n(size = 50, reps = 30)

kable(virtual_samples)
replicate ball_ID color
1 2313 red
1 1783 red
1 1096 white
1 2095 white
1 2324 white
1 1765 red
1 741 white
1 668 red
1 980 white
1 1292 red
1 2030 white
1 1225 white
1 1078 white
1 1873 red
1 1525 white
1 1799 white
1 1706 white
1 1128 white
1 5 white
1 111 white
1 331 red
1 2392 red
1 1804 red
1 946 white
1 2042 red
1 2058 red
1 723 white
1 1808 red
1 1531 red
1 1759 white
1 1809 white
1 1219 white
1 501 red
1 1552 white
1 2369 red
1 876 red
1 27 red
1 1592 white
1 78 white
1 1630 white
1 1342 white
1 2342 red
1 1774 red
1 1886 white
1 1333 white
1 771 white
1 441 white
1 1876 white
1 2215 white
1 250 red
2 1328 red
2 610 white
2 1333 white
2 40 red
2 2112 white
2 1907 white
2 1524 white
2 2347 white
2 1470 white
2 1450 red
2 212 red
2 589 red
2 2175 white
2 2070 white
2 725 white
2 1850 white
2 1199 white
2 2389 white
2 553 white
2 1261 white
2 2351 red
2 1388 white
2 222 red
2 2099 white
2 395 red
2 1736 white
2 2229 white
2 1910 white
2 1463 red
2 2282 red
2 407 red
2 986 red
2 1722 white
2 2088 white
2 741 white
2 2305 red
2 1951 white
2 474 red
2 1098 red
2 1476 red
2 542 white
2 2189 white
2 2103 white
2 289 red
2 1567 red
2 1126 white
2 1851 red
2 2192 red
2 979 white
2 953 red
3 1088 red
3 330 white
3 2369 red
3 1724 white
3 1190 red
3 1861 white
3 227 red
3 845 white
3 1425 red
3 1528 white
3 57 white
3 34 white
3 1998 red
3 1906 red
3 1504 red
3 1172 red
3 581 red
3 2061 white
3 2281 white
3 339 red
3 842 red
3 2057 red
3 1829 white
3 738 white
3 800 red
3 1268 white
3 655 white
3 166 white
3 2194 white
3 1748 red
3 2388 white
3 38 white
3 496 red
3 2285 white
3 739 white
3 1482 red
3 2136 red
3 1432 white
3 1757 white
3 1257 white
3 247 red
3 1990 white
3 625 white
3 380 red
3 2266 red
3 2242 white
3 2282 red
3 1941 white
3 933 white
3 1957 red
4 1963 red
4 232 red
4 2051 white
4 1087 white
4 527 red
4 294 white
4 1731 white
4 1115 white
4 2149 white
4 344 white
4 1250 red
4 1231 white
4 339 red
4 1766 white
4 538 white
4 445 red
4 1230 white
4 1644 white
4 268 white
4 786 red
4 1532 red
4 1627 white
4 433 red
4 839 white
4 2400 white
4 2241 white
4 2220 red
4 309 white
4 2161 red
4 1529 red
4 1274 red
4 2141 white
4 2337 red
4 1551 white
4 1922 white
4 673 white
4 890 white
4 1302 white
4 2390 red
4 1093 white
4 1747 white
4 1440 white
4 286 white
4 1182 white
4 318 white
4 1719 red
4 543 red
4 1726 red
4 2 white
4 2237 white
5 1611 red
5 751 red
5 844 white
5 1002 white
5 1121 white
5 2180 white
5 2295 red
5 744 red
5 1998 red
5 775 white
5 1875 red
5 1643 white
5 523 red
5 777 white
5 274 white
5 1281 red
5 1191 white
5 1725 white
5 1504 red
5 765 white
5 525 red
5 1800 white
5 570 white
5 1206 white
5 885 white
5 2397 red
5 1052 white
5 1566 white
5 890 white
5 179 red
5 994 red
5 2096 white
5 2192 red
5 482 red
5 232 red
5 250 red
5 1523 red
5 1348 red
5 2302 red
5 1538 red
5 395 red
5 46 white
5 2341 white
5 320 white
5 2276 white
5 63 white
5 384 red
5 1519 red
5 2202 white
5 954 white
6 600 white
6 2224 red
6 1163 white
6 29 white
6 1152 white
6 1499 red
6 1031 red
6 1994 white
6 1380 white
6 100 red
6 612 red
6 1077 white
6 1459 white
6 2214 white
6 270 red
6 2084 white
6 374 white
6 2146 white
6 905 white
6 1780 red
6 913 white
6 1664 red
6 985 white
6 506 white
6 2284 red
6 1969 white
6 1275 red
6 1474 red
6 1037 red
6 232 red
6 65 white
6 2226 red
6 1421 white
6 112 white
6 78 white
6 1036 white
6 1136 white
6 1765 red
6 1347 red
6 1739 red
6 480 white
6 2366 white
6 1226 white
6 1579 red
6 1657 white
6 661 white
6 1023 white
6 1367 red
6 1273 white
6 217 white
7 1506 white
7 1720 white
7 1112 red
7 2291 white
7 2121 white
7 1679 white
7 1308 red
7 1985 white
7 930 red
7 2069 white
7 618 white
7 118 red
7 1978 white
7 1026 white
7 1102 red
7 1645 white
7 122 white
7 426 white
7 963 white
7 1197 white
7 74 red
7 199 white
7 2039 white
7 2018 white
7 974 white
7 1524 white
7 2344 red
7 1758 white
7 1669 white
7 1131 white
7 1733 white
7 1897 white
7 828 white
7 1343 white
7 319 white
7 807 red
7 2061 white
7 14 white
7 1024 red
7 1357 white
7 2354 white
7 129 red
7 1984 white
7 42 white
7 1861 white
7 674 red
7 1942 white
7 2178 white
7 1169 white
7 1304 red
8 1199 white
8 243 red
8 273 red
8 2092 red
8 117 white
8 766 white
8 2376 white
8 1563 white
8 1540 white
8 267 white
8 362 red
8 825 white
8 46 white
8 1609 red
8 1383 white
8 1753 red
8 51 white
8 378 white
8 1208 red
8 2196 red
8 896 red
8 2072 white
8 467 white
8 2193 red
8 1353 red
8 2205 red
8 2120 red
8 57 white
8 1916 white
8 818 red
8 312 white
8 1317 white
8 2162 red
8 1689 red
8 777 white
8 2028 white
8 658 red
8 990 white
8 67 red
8 944 white
8 1576 white
8 1645 white
8 2148 white
8 576 red
8 1250 red
8 1850 white
8 1884 red
8 920 red
8 1354 red
8 2099 white
9 2170 white
9 1107 white
9 776 white
9 2084 white
9 1503 white
9 2290 white
9 1034 red
9 1695 white
9 1933 red
9 900 white
9 2227 red
9 677 white
9 2111 white
9 1929 red
9 1031 red
9 2284 red
9 1127 white
9 2308 white
9 1609 red
9 354 red
9 189 white
9 2254 white
9 359 white
9 1725 white
9 2376 white
9 986 red
9 1308 red
9 921 red
9 1967 white
9 574 white
9 1321 white
9 2001 red
9 224 white
9 975 white
9 578 white
9 2076 red
9 2229 white
9 971 red
9 1757 white
9 397 red
9 1788 red
9 244 white
9 1674 white
9 2214 white
9 1608 red
9 112 white
9 878 white
9 1928 white
9 1416 red
9 744 red
10 485 red
10 200 red
10 1556 red
10 815 red
10 1759 white
10 690 red
10 23 white
10 47 white
10 898 white
10 105 red
10 1342 white
10 1065 white
10 1772 white
10 712 white
10 2360 white
10 156 white
10 1006 white
10 1366 white
10 483 white
10 414 red
10 1355 white
10 2288 white
10 1974 red
10 209 red
10 263 white
10 1663 white
10 627 red
10 736 red
10 1668 white
10 106 white
10 1955 white
10 732 white
10 2352 white
10 305 white
10 2065 white
10 34 white
10 1519 red
10 1703 red
10 624 white
10 1089 white
10 955 red
10 293 white
10 1825 white
10 401 white
10 1719 red
10 184 white
10 189 white
10 1440 white
10 589 red
10 592 white
11 783 white
11 751 red
11 161 white
11 1538 red
11 222 red
11 971 red
11 418 red
11 1800 white
11 1433 white
11 1275 red
11 503 white
11 768 white
11 1543 white
11 1205 red
11 413 red
11 110 white
11 2043 red
11 561 white
11 2034 white
11 763 white
11 726 red
11 95 white
11 1787 white
11 1328 red
11 1388 white
11 296 white
11 2132 white
11 133 white
11 504 red
11 940 red
11 1241 white
11 2206 red
11 2110 white
11 2349 red
11 109 white
11 248 white
11 821 red
11 777 white
11 2372 white
11 1330 white
11 654 white
11 11 white
11 526 white
11 1983 white
11 166 white
11 612 red
11 1455 white
11 761 red
11 596 red
11 464 white
12 1709 red
12 232 red
12 1301 white
12 609 white
12 195 white
12 1304 red
12 804 white
12 2262 red
12 1753 red
12 1110 white
12 1394 red
12 2278 white
12 860 white
12 2150 white
12 948 white
12 2257 white
12 454 red
12 2259 red
12 172 white
12 2294 red
12 460 white
12 494 white
12 1734 red
12 2392 red
12 102 red
12 1523 red
12 1121 white
12 1993 white
12 2060 red
12 1851 red
12 1006 white
12 1139 red
12 739 white
12 1202 red
12 461 white
12 478 white
12 141 white
12 2036 white
12 1496 white
12 217 white
12 2267 white
12 1874 white
12 1210 white
12 1130 white
12 1656 white
12 306 red
12 1212 red
12 71 white
12 405 red
12 2067 red
13 600 white
13 451 red
13 1981 red
13 821 red
13 1898 white
13 1541 red
13 1676 red
13 1117 red
13 909 red
13 1024 red
13 1873 red
13 6 white
13 1829 white
13 1136 white
13 1837 white
13 2148 white
13 1781 white
13 360 red
13 199 white
13 897 white
13 2275 white
13 150 white
13 1270 white
13 1810 red
13 77 white
13 2157 red
13 458 white
13 777 white
13 1021 red
13 163 red
13 314 white
13 400 white
13 1851 red
13 2308 white
13 463 red
13 2035 white
13 622 red
13 1709 red
13 653 white
13 1921 white
13 179 red
13 1706 white
13 2146 white
13 452 red
13 530 red
13 539 white
13 1206 white
13 1127 white
13 25 red
13 1952 white
14 2064 white
14 266 red
14 2377 red
14 1772 white
14 809 white
14 1762 red
14 872 white
14 687 red
14 581 red
14 1455 white
14 727 white
14 2246 red
14 2193 red
14 1011 red
14 792 white
14 972 red
14 341 white
14 120 white
14 834 red
14 1482 red
14 1238 red
14 1417 white
14 1521 red
14 319 white
14 1443 white
14 1014 white
14 471 white
14 406 white
14 632 red
14 205 white
14 1818 red
14 1072 white
14 768 white
14 1308 red
14 670 white
14 1047 white
14 1511 white
14 2053 red
14 299 red
14 775 white
14 512 white
14 1624 red
14 405 red
14 1489 white
14 514 white
14 2062 white
14 711 white
14 543 red
14 1712 red
14 1345 white
15 1423 white
15 427 red
15 1364 white
15 374 white
15 183 red
15 2070 white
15 1799 white
15 1345 white
15 1115 white
15 1615 white
15 34 white
15 2104 white
15 1903 white
15 556 red
15 1888 red
15 1600 white
15 1038 red
15 2145 white
15 140 red
15 1192 white
15 1952 white
15 501 red
15 1164 red
15 1878 white
15 1555 red
15 1342 white
15 1056 white
15 1757 white
15 2288 white
15 1375 white
15 1053 red
15 530 red
15 1036 white
15 1225 white
15 2217 red
15 92 white
15 968 red
15 181 white
15 1982 red
15 1427 red
15 1889 red
15 1727 red
15 1335 white
15 2239 white
15 81 white
15 942 white
15 1046 red
15 352 red
15 350 white
15 1766 white
16 78 white
16 2282 red
16 1258 red
16 326 white
16 1489 white
16 676 red
16 2396 white
16 2054 white
16 1025 red
16 2024 white
16 2004 white
16 506 white
16 1664 red
16 976 white
16 675 white
16 2294 red
16 1613 white
16 973 white
16 1766 white
16 485 red
16 1042 white
16 58 white
16 2150 white
16 388 white
16 2254 white
16 449 white
16 2006 red
16 1616 red
16 876 red
16 954 white
16 2145 white
16 999 red
16 1070 white
16 2350 white
16 504 red
16 55 white
16 1923 white
16 929 red
16 2340 red
16 2171 white
16 1176 red
16 2041 white
16 103 white
16 1946 white
16 401 white
16 2031 white
16 375 white
16 2204 white
16 1180 white
16 369 white
17 382 white
17 427 red
17 28 white
17 1690 white
17 125 white
17 499 red
17 1109 white
17 65 white
17 1860 white
17 1626 white
17 1432 white
17 218 white
17 760 white
17 1569 white
17 856 white
17 1605 red
17 55 white
17 450 red
17 550 red
17 1997 red
17 1656 white
17 2114 white
17 1538 red
17 1207 white
17 1298 white
17 431 white
17 345 white
17 569 red
17 2031 white
17 2120 red
17 2020 red
17 1423 white
17 787 red
17 1355 white
17 833 red
17 208 red
17 867 white
17 2111 white
17 1284 white
17 2156 white
17 1755 red
17 2358 white
17 220 red
17 1070 white
17 2155 white
17 1291 white
17 881 red
17 2047 white
17 1308 red
17 1239 white
18 1451 red
18 1883 red
18 1760 red
18 2378 red
18 1629 white
18 2369 red
18 875 white
18 833 red
18 1023 white
18 1499 red
18 156 white
18 744 red
18 279 white
18 425 red
18 1949 white
18 1501 white
18 1735 red
18 917 white
18 386 white
18 1665 red
18 281 white
18 1191 white
18 1577 white
18 868 red
18 2314 white
18 1808 red
18 1913 white
18 1653 white
18 940 red
18 1558 red
18 1038 red
18 1897 white
18 1916 white
18 283 red
18 1981 red
18 742 white
18 460 white
18 1573 white
18 96 white
18 1363 red
18 2387 white
18 1348 red
18 304 white
18 282 white
18 1218 white
18 769 white
18 83 white
18 1093 white
18 1461 red
18 2291 white
19 2200 white
19 1153 white
19 13 white
19 964 white
19 2280 white
19 128 white
19 2329 white
19 1415 red
19 775 white
19 2127 white
19 1076 white
19 1262 white
19 1588 red
19 1552 white
19 313 white
19 2214 white
19 1375 white
19 1593 white
19 1303 red
19 2059 white
19 413 red
19 1068 red
19 1748 red
19 1998 red
19 2012 red
19 39 white
19 649 white
19 1719 red
19 1704 red
19 1854 white
19 64 red
19 1135 white
19 731 white
19 36 white
19 331 red
19 577 white
19 1570 white
19 2266 red
19 1908 white
19 1402 white
19 512 white
19 894 white
19 2301 red
19 102 red
19 911 white
19 427 red
19 1653 white
19 810 white
19 1562 red
19 994 red
20 875 white
20 124 white
20 666 white
20 246 red
20 582 white
20 2207 white
20 1217 red
20 111 white
20 1866 white
20 1278 white
20 818 red
20 1359 red
20 1091 red
20 1731 white
20 1072 white
20 1443 white
20 1813 white
20 1059 white
20 1759 white
20 1457 white
20 1002 white
20 1294 white
20 247 red
20 1291 white
20 1724 white
20 843 white
20 2172 red
20 913 white
20 1971 white
20 206 white
20 2340 red
20 192 white
20 1670 white
20 458 white
20 1733 white
20 182 white
20 2074 red
20 19 white
20 1778 white
20 1119 red
20 1686 white
20 1896 red
20 1364 white
20 646 white
20 508 red
20 2058 red
20 1452 red
20 562 white
20 1680 white
20 1984 white
21 923 white
21 229 white
21 1242 red
21 252 white
21 744 red
21 1473 red
21 732 white
21 1013 white
21 1896 red
21 1316 red
21 1166 white
21 1487 red
21 702 white
21 1663 white
21 1389 red
21 2065 white
21 1041 red
21 683 red
21 1764 red
21 580 white
21 1972 red
21 945 white
21 175 red
21 204 white
21 535 white
21 92 white
21 928 white
21 1870 white
21 1059 white
21 1283 red
21 1719 red
21 1610 red
21 226 white
21 708 white
21 1359 red
21 255 white
21 1455 white
21 1784 white
21 1742 red
21 884 white
21 1490 white
21 214 white
21 209 red
21 751 red
21 1270 white
21 1738 red
21 1482 red
21 2198 white
21 2027 red
21 298 red
22 1955 white
22 1591 white
22 489 red
22 1793 white
22 541 white
22 1100 white
22 220 red
22 2362 red
22 1833 white
22 603 white
22 1628 red
22 863 white
22 865 white
22 1047 white
22 650 white
22 510 white
22 1726 red
22 378 white
22 1417 white
22 735 red
22 879 white
22 1016 red
22 1846 red
22 912 red
22 567 white
22 628 white
22 1030 white
22 371 white
22 1060 white
22 1056 white
22 679 white
22 533 white
22 911 white
22 473 white
22 1584 white
22 805 white
22 469 red
22 1899 white
22 1281 red
22 1553 white
22 1081 white
22 1461 red
22 839 white
22 1120 white
22 620 red
22 151 red
22 2246 red
22 278 white
22 1704 red
22 2304 red
23 1872 white
23 2071 red
23 580 white
23 1225 white
23 2003 red
23 2011 red
23 918 white
23 530 red
23 505 red
23 105 red
23 627 red
23 823 white
23 774 white
23 959 red
23 2142 red
23 990 white
23 731 white
23 470 red
23 1496 white
23 1885 red
23 1473 red
23 2087 red
23 2364 white
23 932 white
23 840 white
23 1456 red
23 1489 white
23 485 red
23 1938 white
23 36 white
23 276 red
23 622 red
23 540 white
23 598 white
23 328 white
23 936 white
23 1297 white
23 2312 white
23 1296 red
23 2242 white
23 1732 white
23 1151 white
23 1663 white
23 1724 white
23 232 red
23 721 white
23 1969 white
23 2335 white
23 546 red
23 1900 white
24 1767 red
24 2076 red
24 2392 red
24 759 red
24 551 red
24 1697 red
24 1377 white
24 1086 red
24 1711 white
24 2368 white
24 1410 white
24 348 red
24 1848 white
24 961 white
24 642 white
24 1421 white
24 1726 red
24 1120 white
24 925 red
24 615 white
24 2339 white
24 629 red
24 228 white
24 1155 white
24 1603 white
24 947 white
24 29 white
24 139 white
24 2110 white
24 254 white
24 2189 white
24 2270 white
24 1328 red
24 1755 red
24 2250 red
24 2372 white
24 2044 white
24 536 red
24 1919 red
24 1197 white
24 213 white
24 549 white
24 1803 red
24 853 white
24 864 red
24 2059 white
24 924 red
24 2121 white
24 2259 red
24 1566 white
25 1480 white
25 271 white
25 1009 white
25 644 white
25 1071 red
25 2315 white
25 1602 red
25 955 red
25 1530 white
25 1865 white
25 1513 red
25 2177 white
25 976 white
25 1719 red
25 640 white
25 1297 white
25 358 white
25 1631 red
25 1340 white
25 2345 white
25 691 red
25 862 red
25 1238 red
25 350 white
25 636 red
25 1342 white
25 1338 red
25 971 red
25 238 red
25 1469 white
25 959 red
25 1718 white
25 1126 white
25 63 white
25 2195 white
25 455 white
25 2079 red
25 802 red
25 1465 red
25 2084 white
25 898 white
25 1001 white
25 1877 white
25 1707 white
25 700 red
25 652 white
25 1641 red
25 1403 white
25 1659 white
25 1117 red
26 1375 white
26 2244 white
26 2380 red
26 2228 red
26 1854 white
26 2265 red
26 35 white
26 612 red
26 488 white
26 998 red
26 940 red
26 850 red
26 1075 white
26 1449 red
26 909 red
26 309 white
26 1952 white
26 1986 red
26 1984 white
26 1865 white
26 2293 white
26 1428 red
26 482 red
26 1846 red
26 1272 white
26 243 red
26 134 red
26 863 white
26 2324 white
26 1170 white
26 135 red
26 452 red
26 1254 white
26 982 white
26 1559 white
26 1492 white
26 1508 white
26 1631 red
26 1767 red
26 997 white
26 931 white
26 253 white
26 1571 white
26 1212 red
26 1489 white
26 2164 white
26 883 red
26 463 red
26 667 white
26 544 red
27 10 white
27 816 red
27 992 white
27 325 white
27 1803 red
27 1022 white
27 2364 white
27 733 white
27 1416 red
27 1217 red
27 1001 white
27 1679 white
27 1949 white
27 686 red
27 2300 white
27 1681 red
27 20 white
27 870 white
27 1594 red
27 1796 white
27 218 white
27 395 red
27 180 white
27 672 white
27 606 red
27 892 white
27 948 white
27 653 white
27 1816 white
27 872 white
27 174 red
27 1973 white
27 1374 red
27 1085 white
27 1190 red
27 2011 red
27 1788 red
27 1569 white
27 1266 white
27 1704 red
27 1076 white
27 126 red
27 793 white
27 172 white
27 945 white
27 938 red
27 755 red
27 1477 red
27 2075 white
27 1669 white
28 1303 red
28 68 white
28 512 white
28 2181 white
28 2271 red
28 1475 white
28 1917 red
28 1597 red
28 1042 white
28 947 white
28 883 red
28 180 white
28 814 red
28 582 white
28 2305 red
28 2285 white
28 391 red
28 878 white
28 61 white
28 1430 red
28 1041 red
28 824 red
28 508 red
28 2144 white
28 942 white
28 238 red
28 1368 white
28 2352 white
28 510 white
28 1805 white
28 2365 red
28 214 white
28 1696 red
28 749 white
28 2099 white
28 857 white
28 409 white
28 106 white
28 628 white
28 1439 red
28 2171 white
28 1694 white
28 1905 white
28 1712 red
28 522 red
28 2126 white
28 1477 red
28 1631 red
28 449 white
28 1335 white
29 908 white
29 1778 white
29 194 white
29 556 red
29 424 white
29 1394 red
29 1194 red
29 1044 white
29 2317 white
29 969 white
29 673 white
29 991 white
29 39 white
29 2215 white
29 505 red
29 870 white
29 1997 red
29 155 red
29 434 white
29 1602 red
29 1430 red
29 1156 white
29 822 red
29 1336 white
29 1356 red
29 1527 white
29 632 red
29 836 red
29 1806 red
29 1880 red
29 1607 white
29 2002 white
29 1698 white
29 1966 red
29 89 red
29 1072 white
29 1359 red
29 1509 white
29 187 white
29 1564 white
29 2237 white
29 384 red
29 678 white
29 1957 red
29 123 white
29 1629 white
29 345 white
29 92 white
29 803 white
29 90 white
30 1908 white
30 472 white
30 2013 white
30 2306 white
30 216 red
30 1171 red
30 1297 white
30 497 white
30 536 red
30 412 red
30 1299 red
30 2396 white
30 2221 white
30 937 red
30 678 white
30 1006 white
30 1740 white
30 490 white
30 1457 white
30 1423 white
30 1319 white
30 918 white
30 549 white
30 1756 red
30 276 red
30 2192 red
30 804 white
30 844 white
30 2025 red
30 1172 red
30 1596 red
30 651 white
30 777 white
30 2397 red
30 1622 red
30 491 red
30 2210 white
30 2166 white
30 2027 red
30 410 red
30 2160 white
30 1120 white
30 560 white
30 1959 red
30 981 red
30 477 white
30 1240 white
30 56 white
30 1898 white
30 1741 red

Observe that while the first 50 rows of replicate are equal to 1, the next 50 rows of replicate are equal to 2. This is telling us that the first 50 rows correspond to the first sample of 50 balls while the next 50 correspond to the second sample of 50 balls. This pattern continues for all reps = 30 replicates and thus virtual_samples has \(30 \times 50 = 1500\) rows.

virtual_prop_red <- virtual_samples %>% 
  group_by(replicate) %>% 
  summarize(red = sum(color == "red")) %>% 
  mutate(prop_red = red / 50)

virtual_prop_red
# A tibble: 30 x 3
   replicate   red prop_red
       <int> <int>    <dbl>
 1         1    20     0.4 
 2         2    21     0.42
 3         3    23     0.46
 4         4    18     0.36
 5         5    24     0.48
 6         6    19     0.38
 7         7    12     0.24
 8         8    23     0.46
 9         9    19     0.38
10        10    16     0.32
# … with 20 more rows
#kable(virtual_prop_red) # To see all 30 samples

Let’s visualize the distribution of these 33 proportions red based on 33 virtual samples using a histogram with binwidth = 0.05

ggplot(virtual_prop_red, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white", fill = "steelblue") +
  labs(x = "Proportion of 50 balls that were red", 
       title = "Distribution of 30 proportions red") 

Observe that occasionally we obtained proportions red that are less than ____, while on the other hand we occasionally we obtained proportions that are greater than ____. However, the most frequently occurring proportions red out of 50 balls were between ____ % and ____ % (for ___ out 30 samples). Why do we have these differences in proportions red? Because of ___________________.

Exercise 1.1 Redo the above activity with 1000 repeated samples and state your conclusions.

1.2.1 Using different shovels

If your goal was still to estimate the proportion of the bowl’s balls that were red, which shovel would you choose? Why? Let’s try to answer these questions.

# Segment 1: sample size = 25 ------------------------------
# 1.a) Virtually use shovel 1000 times
virtual_samples_25 <- bowl %>% 
  rep_sample_n(size = 25, reps = 1000)

# 1.b) Compute resulting 1000 replicates of proportion red
virtual_prop_red_25 <- virtual_samples_25 %>% 
  group_by(replicate) %>% 
  summarize(red = sum(color == "red")) %>% 
  mutate(prop_red = red / 25)

# 1.c) Plot distribution via a histogram
p1 <- ggplot(virtual_prop_red_25, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
  labs(x = "Pro of 25 balls that were red", title = "25") 

# Segment 2: sample size = 50 ------------------------------
# 2.a) Virtually use shovel 1000 times
virtual_samples_50 <- bowl %>% 
  rep_sample_n(size = 50, reps = 1000)

# 2.b) Compute resulting 1000 replicates of proportion red
virtual_prop_red_50 <- virtual_samples_50 %>% 
  group_by(replicate) %>% 
  summarize(red = sum(color == "red")) %>% 
  mutate(prop_red = red / 50)

# 2.c) Plot distribution via a histogram
p2 <- ggplot(virtual_prop_red_50, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
  labs(x = "Pro of 50 balls that were red", title = "50")  

# Segment 3: sample size = 100 ------------------------------
# 3.a) Virtually using shovel with 100 slots 1000 times
virtual_samples_100 <- bowl %>% 
  rep_sample_n(size = 100, reps = 1000)

# 3.b) Compute resulting 1000 replicates of proportion red
virtual_prop_red_100 <- virtual_samples_100 %>% 
  group_by(replicate) %>% 
  summarize(red = sum(color == "red")) %>% 
  mutate(prop_red = red / 100)

# 3.c) Plot distribution via a histogram
p3 <- ggplot(virtual_prop_red_100, aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
  labs(x = "Pro of 100 balls that were red", title = "100") 


plot_grid(p1, p2, p3, nrow = 1)

Observe that as the sample size increases, the ______ of the 1000 replicates of the proportion red decreases. In other words, as the sample size increases, there are less differences due to sampling variation and the distribution centers more tightly around the same value. Eyeballing the above Figure, things appear to center tightly around roughly ____%.

# n = 25
virtual_prop_red_25 %>% 
  summarize(sd = sd(prop_red))
# A tibble: 1 x 1
      sd
   <dbl>
1 0.0957
# n = 50
virtual_prop_red_50 %>% 
  summarize(sd = sd(prop_red))
# A tibble: 1 x 1
      sd
   <dbl>
1 0.0702
# n = 100
virtual_prop_red_100 %>% 
  summarize(sd = sd(prop_red))
# A tibble: 1 x 1
      sd
   <dbl>
1 0.0486
Number of slots in shovel Standard deviation of proportions red
25 0.0978
50 0.0669
100 0.0471

As the sample size increases our numerical measure of spread decreases; there is less variation in our proportions red. In other words, as the sample size increases, our guesses at the true proportion of the bowl’s balls that are red get more consistent and precise.

1.3 What did we learn?

This was our first attempt at understanding two key concepts relating to sampling for estimation:

  1. The effect of sampling variation on our estimates.
  2. The effect of sample size on sampling variation.

Let’s now introduce some terminology and notation as well as statistical definitions related to sampling.

1.4 Terminology & notation

  1. (Study) Population: A (study) population is a collection of individuals or observations about which we are interested. We mathematically denote the population’s size using upper case N. In our simulations the (study) population was the collection of N = 2400 identically sized red and white balls contained in the bowl.

  2. Population parameter: A population parameter is a numerical summary quantity about the population that is unknown, but you wish you knew. For example, when this quantity is a mean, the population parameter of interest is the population mean which is mathematically denoted with the Greek letter \(\mu\) (pronounced “mu”). In our simulations however since we were interested in the proportion of the bowl’s balls that were red, the population parameter is the population proportion which is mathematically denoted with the letter \(p\).

  3. Census: An exhaustive enumeration or counting of all \(N\) individuals or observations in the population in order to compute the population parameter’s value exactly. In our simulations, this would correspond to manually going over all \(N = 2400\) balls in the bowl and counting the number that are red and computing the population proportion \(p\) of the balls that are red exactly. When the number \(N\) of individuals or observations in our population is large, as was the case with our bowl, a census can be very expensive in terms of time, energy, and money.

  4. Sampling: Sampling is the act of collecting a sample from the population when we don’t have the means to perform a census. We mathematically denote the sample’s size using lower case \(n\), as opposed to upper case \(N\) which denotes the population’s size. Typically the sample size \(n\) is much smaller than the population size \(N\), thereby making sampling a much cheaper procedure than a census. In our simulations, we used shovels with 25, 50, and 100 slots to extract a sample of size \(n = 25\), \(n = 50\), and \(n = 100\) balls.

  5. Point estimate (AKA sample statistic): A summary statistic computed from the sample that estimates the unknown population parameter. In our simulations, recall that the unknown population parameter was the population proportion and that this is mathematically denoted with p. Our point estimate is the sample proportion: the proportion of the shovel’s balls that are red. In other words, it is our guess of the proportion of the bowl’s balls balls that are red. We mathematically denote the sample proportion using \(\hat{p}\); the “hat” on top of the p indicates that it is an estimate of the unknown population proportion \(p\).

  6. Representative sampling: A sample is said be a representative sample if it is representative of the population. In other words, are the sample’s characteristics a good representation of the population’s characteristics? In our simulations, are the samples of \(n\) balls extracted using our shovels representative of the bowl’s \(N = 2400\) balls?

  7. Generalizability: We say a sample is generalizable if any results based on the sample can generalize to the population. In other words, can the value of the point estimate be generalized to estimate the value of the population parameter well? In our simulations, can we generalize the values of the sample proportions red of our shovels to the population proportion red of the bowl? Using mathematical notation, is \(\hat{p}\) a “good guess” of \(p\)?

  8. Bias: In a statistical sense, we say bias occurs if certain individuals or observations in a population have a higher chance of being included in a sample than others. We say a sampling procedure is unbiased if every observation in a population had an equal chance of being sampled. In our simulations, since each ball had the same size and hence an equal chance of being sample in our shovels, our samples were unbiased.

  9. Random sampling: We say a sampling procedure is random if we sample randomly from the population in an unbiased fashion. In our simulations, this would correspond to sufficiently mixing the bowl before each use of the shovel.

Let’s put them all together:

  • If we extract a sample of \(n=50\) balls at random, in other words we mix the equally-sized balls before using the shovel, then

  • the contents of the shovel are an unbiased representation of the contents of the bowl’s 2400 balls, thus

  • any result based on the sample of balls can generalize to the bowl, thus

  • the sample proportion \(\hat{p}\) of the \(n=50\) balls in the shovel that are red is a “good guess” of the population proportion \(p\) of the \(N =2400\) balls that are red, thus

  • instead of manually going over all the balls in the bowl, we can infer about the bowl using the shovel.

Definition 1.1 The sampling distribution of a Statistic (e.g. Mean, Median, Proportion, etc) is its probability distribution.

Definition 1.2 The standard deviation of a sampling distribution is called the standard error.

Example: This is the same table as above, but notice the 2nd column name.

Number of slots in shovel Standard Error of proportions red
25 0.0978
50 0.0669
100 0.0471
Exercise 1.2 Find and plot the sampling distribution of the proportion (\(\hat{p}\)) of heads when you flip a fair coin. (Use 5000 sets of 10 tosses)
  1. What is the sample size?

  2. How many experiments?

  3. Find and plot the sampling distribution of \(\hat{p}\).

  4. Find the standard error of the sampling distribution of \(\hat{p}\).

  5. What happens to the standard error, when you increase the sample size?


1.5 DETOUR — Some brand name distributions

1.5.1 Normal Distribution

The normal distribution is defined by the following probability density function, where \(\mu\) is the population mean and \(\sigma\) is the standard deviation.

\[f(x) = \dfrac{1}{\sigma \sqrt{2 \pi}}e^{-(x-\mu)^2/{2\sigma^2}}\]

If a random variable \(X\) follows the normal distribution, then we write: \(X \sim N(\mu, \sigma^2)\)

Here is how the normal density looks like: Ex: here \(X \sim N(0, 1)\)

# Ignore this code

p1 <- ggplot(data = data.frame(x = c(-3, 3)), aes(x)) +
  stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1)) + ylab("") +
  scale_y_continuous(breaks = NULL)
p1

R functions: (Package: stats)

Examples:

  1. Generate a random sample of 100 from \(N(15, 9)\) and create a histrogram.
x <- rnorm(n = 100, mean = 15, sd = 3)

ggplot(data.frame(x), aes(x = x)) + geom_histogram(binwidth = 1.5) 

  1. If \(X \sim N(15, 9)\) find the probability that X being greater than 21: \(P(X > 21)\)
pnorm(21, mean = 15, sd = 3) # pnorm gives us the left tail area to a given number, 21 in this case
[1] 0.9772499
  1. If \(X \sim N(15, 9)\) find the 25th quantile.
qnorm(.25, mean = 15, sd = 3)
[1] 12.97653
#pnorm(12.97653, mean = 15, sd = 3)

Question:

  1. A radar unit is used to measure speeds of cars on a motorway. The speeds are N(90 km/hr, 10 km/hr). What is the probability that a car picked at random is travelling at more than 100 km/hr?

  2. GMAT are roughly normally distributed with a mean of 527 and a standard deviation of 112. How high must an individual score on the GMAT in order to score in the highest 5%?

1.5.2 Exponential Distribution

The Exponential Distribution is defined by the following probability density function, where \(\dfrac{1}{\lambda}\) is the population mean and standard deviation.

\[f(x) = \lambda e^{-\lambda x}\]

If a random variable \(X\) follows the Exponential Distribution, then we write: \(X \sim Exp(\lambda)\)

Here is how the Exponential density looks like: Ex: here \(X \sim Exp(1/15)\)

# Ignore this code
x <- seq(0, 100, length.out=1000)
dat <- data.frame(x=x, px=dexp(x, rate=1/15))

ggplot(dat, aes(x=x, y=px)) + geom_line()

R functions: (Package: stats)

Examples:

  1. Generate a random sample of 100 from \(Exp(1/15)\) and create a histrogram.
x <- rexp(n = 100, rate = 1/15)

ggplot(data.frame(x), aes(x = x)) + geom_histogram(binwidth = 5) 

  1. If \(X \sim Exp(1/15)\) find the probability that X being less than 21: \(P(X > 21)\)
pexp(21, rate = 1/15)
[1] 0.753403
  1. If \(X \sim Exp(1/15)\) find the 75th quantile.
qexp(.75, rate = 1/15)
[1] 20.79442
#pexp(20.79442, rate = 1/15)

Question:

The number of days ahead travelers purchase their airline tickets can be modeled by an exponential distribution with the average amount of time equal to 15 days.

  1. Find the probability that a traveler will purchase a ticket fewer than ten days in advance.

  2. How many days do 80% of all travelers wait?

1.5.3 Binomial Distribution

The binomial distribution is a discrete probability distribution. It describes the outcome of n independent trials in an experiment. Each trial is assumed to have only two outcomes, either success or failure. If the probability of a successful trial is p, then the probability of having x successful outcomes in an experiment of n independent trials is as follows.

\[f(x) = {n \choose x} p^x (1-p)^{(n-x)} \quad \text{where x = 0, 1, 2,...,n}\]

Example: Suppose there are twelve multiple choice questions in an English class quiz. Each question has five possible answers, and only one of them is correct.

  1. Find the probability of having exactly four correct answers if a student attempts to answer every question at random.

  2. Find the probability of having four or less correct answers if a student attempts to answer every question at random.

Solution:

Since only one out of five possible answers is correct, the probability of answering a question correctly by random is 1/5=0.2.

  1. By hand:

\({12 \choose 4} 0.2^4 (1-0.2)^{(12-4)} = 0.1329\)

In R:

dbinom(4, size=12, prob=0.2) 
[1] 0.1328756
  1. By hand:

\({12 \choose 4} 0.2^4 (1-0.2)^{(12-4)} + {12 \choose 3} 0.2^3 (1-0.2)^{(12-3)} + {12 \choose 2} 0.2^2 (1-0.2)^{(12-2)} + {12 \choose 1} 0.2^1 (1-0.2)^{(12-1)} + {12 \choose 0} 0.2^0 (1-0.2)^{(12-0)} = 0.9274\)

In R:

OR Alternatively,

END of the DETOUR!


1.6 Theoritical Sampling Distribution of the Proportion

Sampling distribution of \(\hat{p}\) is normal with mean \(p\)(Actual population proportion) and standard deviation of \(\sqrt{p(1-p)/n}\)

\[\hat{p} \sim N( p, \sqrt{p(1-p)/n} )\]

Let’s revisit the coin flip example: What we did is an simulation. Now we can check how close our simulation results are to the theoritical results.

Now let’s find the theoritical mean and the standard deviation the sampling distribution of the proportion:

Here \(p = 0.5\), \(n = 10\)

Therefore \(\hat{p} \sim N( p, \sqrt{p(1-p)/n} ) = N( 0.5, \sqrt{0.5(1-0.5)/10} ) = N( 0.5, 0.1581139)\)

1.7 Sampling Distribution of the Mean